Rachel the Robo Caller - Analysis

At DEF CON 22, the FTC ran a contest to help mitigate robocalls. There were three rounds, the last of which was using a set of call records collected from a robocall honeypot to determine if a caller was a robocaller. See Parts I and II of the contest for details on robocaller honeypots.

The FTC gave us two sets of data, that show a phone call from one "person" to another along with the date and time. Both collections have been randomized uniquely, but the portions of area code and subscriber number were kept the same.

This Notebook details initial exploration of the data. For the follow up on predictions, check out Modeling Rachel the Robocaller.


In [21]:
from IPython.display import Image
Image("http://www.ftc.gov/system/files/attachments/zapping-rachel/zapping-rachel-contest.jpg")


Out[21]:

In [24]:
%matplotlib inline
# Standard toolkits in pydata land
import pandas as pd
import numpy as np

In [2]:
# Neat little library that is a partial port of Google's libphonenumber
import phonenumbers
from phonenumbers import geocoder
# from phonenumbers import carrier
from phonenumbers import timezone

In [3]:
# First pass will use a Random Forest; more on this later
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier

In [4]:
def read_FTC(dataset):
    return pd.read_csv(dataset,
                parse_dates=["DATE/TIME"],
                converters={'LIKELY ROBOCALL': lambda val: val == 'X'},
                dtype={'TO': str, 'FROM': str, 'LIKELY ROBOCALL': bool}
    )

# This assumes you have the data locally
labeled_data = read_FTC("FTC-DEFCON Data Set 1.csv")
unlabeled_data = read_FTC("FTC-DEFCON Data Set 2.csv")

In [5]:
labeled_data.head()


Out[5]:
TO FROM DATE/TIME LIKELY ROBOCALL
0 17866291260 13055793696 2014-04-01 False
1 14027826713 12063339487 2014-04-01 True
2 17083187970 12246108402 2014-04-01 False
3 17733095581 13035009570 2014-04-01 True
4 19188765408 16153878533 2014-04-01 True

In [6]:
unlabeled_data.head()


Out[6]:
TO FROM DATE/TIME LIKELY ROBOCALL
0 16163847430 13236069958 2014-06-01 False
1 12025176283 12029867020 2014-06-01 False
2 18663049187 15159256650 2014-06-01 False
3 15594157085 16199247140 2014-06-01 False
4 18582407865 19492012595 2014-06-01 False

First things to note right off the bat:

  1. We could let phonenumbers parse the numbers out for us and cache that
  2. The phone numbers are not really a numeric value and should be treated as categorical data
  3. The phone number should be broken up into individual categorical features, likely:
    • Area code
    • Carrier/Subscriber
    • Not the last 4 digits though as they are randomized and need to be paired with the rest of the number to be unique
  4. It is unknown whose timezone the date and time is in. They could (should?) be normalized for each calling side
  5. Time zone can be extracted from the phone numbers themselves, but it says nothing about where the caller actually is

Let's see how the phonenumbers library works


In [7]:
# Pulling a random number from the data set
fake_number = phonenumbers.parse("19188765408")


---------------------------------------------------------------------------
NumberParseException                      Traceback (most recent call last)
<ipython-input-7-5acd0111a6aa> in <module>()
      1 # Pulling a random number from the data set
----> 2 fake_number = phonenumbers.parse("19188765408")

/usr/local/lib/python2.7/dist-packages/phonenumbers/phonenumberutil.pyc in parse(number, region, keep_raw_input, numobj, _check_region)
   2453     if _check_region and not _check_region_for_parsing(national_number, region):
   2454         raise NumberParseException(NumberParseException.INVALID_COUNTRY_CODE,
-> 2455                                    "Missing or invalid default region.")
   2456     if keep_raw_input:
   2457         numobj.raw_input = number

NumberParseException: (0) Missing or invalid default region.

In [8]:
# Looking back at their docs, a leading '+' and a region of None will make phonenumbers attempt to detect region, etc.
fake_number = phonenumbers.parse("+19188765408", None)
fake_number


Out[8]:
PhoneNumber(country_code=1, national_number=9188765408, extension=None, italian_leading_zero=False, number_of_leading_zeros=None, country_code_source=None, preferred_domestic_carrier_code=None)

In [9]:
fake_number.country_code


Out[9]:
1

In [10]:
phonenumbers.is_valid_number(fake_number)


Out[10]:
True

In [11]:
geocoder.description_for_number(fake_number, "EN")


Out[11]:
'Oklahoma'

In [12]:
timezone.time_zones_for_number(fake_number)


Out[12]:
('America/Chicago',)

Picking out features for the numbers themselves


In [13]:
# Do they all start with a 1?
print(labeled_data["TO"].str.get(0).unique())
print(labeled_data["FROM"].str.get(0).unique())
print(unlabeled_data["TO"].str.get(0).unique())
print(unlabeled_data["FROM"].str.get(0).unique())


['1']
['1']
['1']
['1']

Yup! This means we're using the North American Numbering Plan.

The NANP is a system of numbering plan areas (NPA) using telephone numbers consisting of a three-digit area code, a three-digit central office code, and a four-digit station number. Through this plan, telephone calls can be directed to particular regions of the larger NANP public switched telephone network (PSTN), where they are further routed by the local networks. The NANP is administered by the North American Numbering Plan Administration (NANPA), a service operated by Neustar corporation. The international calling code for the NANP is 1.

Our phone number structure is then CAAAOOONNNN where C is the country code, AAA is the area code, OOO is the "central office" code (does this come from the old operator days?), and NNNN are the rest of the unique digits for a caller (formally called the station number). We have randomized calls though, so we'll be ignoring NNNN as part of any feature on their own.

Parsing the area code and central office code is trivial with Pandas semantics. However, there are a few utilities in the phonenumbers library that might help in a little bit though, namely:

  • geocoder.description_for_number
  • phonenumbers.is_valid_number
  • timezone.time_zones_for_number

In [14]:
# Let's go ahead and parse all of them 
# We'll parse with a leading + based on the numbers being listed with a leading country code,
# leave the second argument as None so that the phonenumbers package has to try to detect 

labeled_data["TO_PARSED"] = labeled_data["TO"].apply(lambda row: phonenumbers.parse("+" + row, None))
labeled_data["FROM_PARSED"] = labeled_data["FROM"].apply(lambda row: phonenumbers.parse("+" + row, None))

In [15]:
labeled_data["TO_VALID"] = labeled_data["TO_PARSED"].apply(lambda ph: phonenumbers.is_valid_number(ph))
labeled_data["FROM_VALID"] = labeled_data["FROM_PARSED"].apply(lambda ph: phonenumbers.is_valid_number(ph))

In [16]:
labeled_data.TO_VALID.unique()


Out[16]:
array([True], dtype=object)

In [17]:
labeled_data.FROM_VALID.unique()


Out[17]:
array([True, False], dtype=object)

There are invalid numbers in the from case?!? What proportion of those are from the likely robocallers?


In [18]:
from_valid_v_robocall = pd.crosstab([labeled_data.FROM_VALID], labeled_data['LIKELY ROBOCALL'])
from_valid_v_robocall.plot(kind='bar', stacked=True, grid=False, color=["blue", "red"])
from_valid_v_robocall


Out[18]:
LIKELY ROBOCALL False True
FROM_VALID
False 839 356
True 91927 44241

In [25]:
from_valid_v_robocall.div(from_valid_v_robocall.sum(1).astype(float), axis=0).plot(kind='barh', stacked=True, color=["blue", "red"])


Out[25]:
<matplotlib.axes.AxesSubplot at 0x7fd32400ef90>

Come to think of it, maybe from valid is a no-good-bad-feature as the numbers are randomized but not necessarily made valid by the FTC. Hmmm... Moving on.

While we're at it, might as well make a utility function to do our cross tabulation plots against likely robocalls.


In [26]:
def explore_feature(df, name):
    feature_v_robocall = pd.crosstab([df[name]], df['LIKELY ROBOCALL'])
    feature_v_robocall.plot(kind='bar', stacked=True, grid=False, color=["blue", "red"])
    fvr_div = feature_v_robocall.div(feature_v_robocall.sum(1).astype(float), axis=0)
    fvr_div.plot(kind='barh', stacked=True, color=["blue", "red"])
    return feature_v_robocall

In [27]:
labeled_data["TO_DESCRIPTION"] = labeled_data["TO_PARSED"].apply(lambda ph: geocoder.description_for_number(ph, "EN"))
labeled_data["FROM_DESCRIPTION"] = labeled_data["FROM_PARSED"].apply(lambda ph: geocoder.description_for_number(ph, "EN"))

def get_time_zone(ph):
    tz = timezone.time_zones_for_number(ph)

labeled_data["TO_TIMEZONE"] = labeled_data["TO_PARSED"].apply(lambda ph: timezone.time_zones_for_number(ph))
labeled_data["FROM_TIMEZONE"] = labeled_data["FROM_PARSED"].apply(lambda ph: timezone.time_zones_for_number(ph))

In [28]:
labeled_data["FROM_TIMEZONE"].unique()


Out[28]:
array([('America/New_York',), ('America/Los_Angeles',),
       ('America/Chicago',), ('America/Denver',),
       ('America/Anguilla', 'America/Antigua', 'America/Barbados', 'America/Cayman', 'America/Chicago', 'America/Denver', 'America/Dominica', 'America/Edmonton', 'America/Grand_Turk', 'America/Grenada', 'America/Halifax', 'America/Jamaica', 'America/Juneau', 'America/Los_Angeles', 'America/Lower_Princes', 'America/Montserrat', 'America/Nassau', 'America/New_York', 'America/Port_of_Spain', 'America/Puerto_Rico', 'America/St_Johns', 'America/St_Kitts', 'America/St_Lucia', 'America/St_Thomas', 'America/St_Vincent', 'America/Toronto', 'America/Tortola', 'America/Vancouver', 'America/Winnipeg', 'Atlantic/Bermuda', 'Pacific/Guam', 'Pacific/Honolulu', 'Pacific/Pago_Pago', 'Pacific/Saipan'),
       (u'Etc/Unknown',), ('America/Toronto',), ('America/Winnipeg',),
       ('America/Edmonton',), ('America/Juneau',), ('Pacific/Honolulu',),
       ('America/Halifax',), ('America/Vancouver',), ('America/Jamaica',),
       ('America/St_Johns',), ('America/Puerto_Rico',),
       ('America/Grenada',), ('America/St_Thomas',), ('Pacific/Guam',),
       ('Pacific/Saipan',), ('America/Nassau',),
       ('America/Port_of_Spain',), ('America/Dominica',)], dtype=object)

In [29]:
[len(x) for x in labeled_data["FROM_TIMEZONE"].unique()]


Out[29]:
[1, 1, 1, 1, 34, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

In [30]:
# For ease of plotting, I'm turning that long tuple into one string and all these into strings
def get_time_zone(ph):
    # Playing fast and loose here since only one grouping had more than one timezone in one
    tz = timezone.time_zones_for_number(ph)
    if len(tz) > 1:
        tz = ("Etc/Lots",)
    return tz[0]

labeled_data["TO_TIMEZONE"] = labeled_data["TO_PARSED"].apply(lambda ph: get_time_zone(ph))
labeled_data["FROM_TIMEZONE"] = labeled_data["FROM_PARSED"].apply(lambda ph: get_time_zone(ph))

Wow, one of those timezones is pretty much unknown.


In [31]:
labeled_data[labeled_data["FROM_TIMEZONE"] == "Etc/Lots"].groupby("LIKELY ROBOCALL").aggregate(sum)


Out[31]:
TO_VALID FROM_VALID
LIKELY ROBOCALL
False 10817 10817
True 11225 11225

In [32]:
explore_feature(labeled_data, 'FROM_TIMEZONE')


Out[32]:
LIKELY ROBOCALL False True
FROM_TIMEZONE
America/Chicago 18604 6350
America/Denver 5350 1805
America/Dominica 0 1
America/Edmonton 203 15
America/Grenada 1 0
America/Halifax 248 0
America/Jamaica 50 0
America/Juneau 66 0
America/Los_Angeles 20225 11578
America/Nassau 1 0
America/New_York 34589 12832
America/Port_of_Spain 1 0
America/Puerto_Rico 44 0
America/St_Johns 46 0
America/St_Thomas 4 0
America/Toronto 1192 430
America/Vancouver 210 2
America/Winnipeg 143 0
Etc/Lots 10817 11225
Etc/Unknown 839 356
Pacific/Guam 3 0
Pacific/Honolulu 114 3
Pacific/Saipan 16 0

That America/Dominica one looks interesting on the last plot (percentage of likely robocall by FROM_TIMEZONE) but there is only 1 data point. That "Etc/Lots" timezone is probably interesting though.

In reality, the timezone is being pulled out from the country code + the area code. We should just use Pandas semantics on the area code.


In [33]:
# Extract the area code using slicing since they are all regular US numbers
labeled_data["TO_AREA_CODE"] = labeled_data["TO"].str.slice(1,4)
labeled_data["FROM_AREA_CODE"] = labeled_data["FROM"].str.slice(1,4)

In [34]:
labeled_data.TO_AREA_CODE.describe()


Out[34]:
count     137363
unique       300
top          888
freq       11316
dtype: object

In [35]:
labeled_data.FROM_AREA_CODE.describe()


Out[35]:
count     137363
unique       430
top          800
freq        5843
dtype: object

In [36]:
to_area_code_v_likely_robocall = explore_feature(labeled_data, "TO_AREA_CODE")


Methinks there are too many area codes to visualize that. Let's look at just the subset that is potentially interesting.


In [37]:
area_code_div = to_area_code_v_likely_robocall.div(to_area_code_v_likely_robocall.sum(1).astype(float), axis=0)
sample_size = to_area_code_v_likely_robocall.sum(1)
threshold = .20
min_samples = 10
threshold_true_robo = (sample_size > min_samples) & ((area_code_div[True] < threshold) | (area_code_div[True] > (1 - threshold)))

thresholded_area_robo = area_code_div[threshold_true_robo]

to_area_code_v_likely_robocall[threshold_true_robo].plot(kind='bar', stacked=True, grid=False, color=["blue", "red"])
thresholded_area_robo.plot(kind='barh', stacked=True, color=["blue", "red"])
thresholded_area_robo


Out[37]:
LIKELY ROBOCALL False True
TO_AREA_CODE
202 0.822464 0.177536
204 1.000000 0.000000
212 0.851852 0.148148
213 0.837956 0.162044
225 0.820144 0.179856
226 1.000000 0.000000
276 0.828947 0.171053
289 0.982857 0.017143
303 0.822581 0.177419
312 0.802326 0.197674
314 0.814085 0.185915
319 0.876033 0.123967
418 1.000000 0.000000
519 0.959596 0.040404
615 0.810651 0.189349
646 0.827586 0.172414
647 0.923077 0.076923
705 1.000000 0.000000
754 1.000000 0.000000
770 0.812030 0.187970
772 0.843750 0.156250
780 0.981481 0.018519
800 1.000000 0.000000
808 0.806723 0.193277
828 0.913363 0.086637
855 0.904875 0.095125
866 0.986063 0.013937
877 0.993841 0.006159
888 0.991340 0.008660
906 0.805556 0.194444
915 0.822917 0.177083
949 0.897206 0.102794
980 0.818182 0.181818

In [38]:
to_area_code_v_likely_robocall[(area_code_div[True] > (1 - threshold))]


Out[38]:
LIKELY ROBOCALL False True
TO_AREA_CODE
250 0 1
331 1 6

Do it again with office codes?


In [39]:
# Extract the area code using slicing since they are all regular US numbers
#  labeled_data["TO_OFFICE_CODE"] = labeled_data["TO"].str.slice(4,7)
#  labeled_data["FROM_OFFICE_CODE"] = labeled_data["FROM"].str.slice(4,7)

#  Wait a second, these office codes need to be paired with their area codes. We'll have to include those.
labeled_data["TO_OFFICE_CODE"] = labeled_data["TO"].str.slice(1,7)
labeled_data["FROM_OFFICE_CODE"] = labeled_data["FROM"].str.slice(1,7)

In [40]:
# This is going to have the same (and worse) issue that exploring area code did.
# We'll create a thresholded utitlity function here

def explore_thresholded_feature(df, name, threshold=.20, min_samples=10):
    feature_v_robocall = pd.crosstab([df[name]], df['LIKELY ROBOCALL'])
    proportionated_feature = feature_v_robocall.div(feature_v_robocall.sum(1).astype(float), axis=0)
    sample_size = feature_v_robocall.sum(1)
    
    threshold_true_robo = (sample_size > min_samples) & ((proportionated_feature[True] < threshold) | (proportionated_feature[True] > (1 - threshold)))
    
    thresholded_feature_v_robocall = feature_v_robocall[threshold_true_robo]
    
    thresholded_feature_v_robocall.plot(kind='barh', stacked=True, color=["blue", "red"])
    proportionated_feature[threshold_true_robo].plot(kind='barh', stacked=True, color=["blue", "red"])
    
    return thresholded_feature_v_robocall



explore_thresholded_feature(labeled_data, "TO_OFFICE_CODE", threshold=.08, min_samples=25)


Out[40]:
LIKELY ROBOCALL False True
TO_OFFICE_CODE
201266 47 0
201453 102 0
201676 34 0
201693 37 1
202640 25 2
203416 29 0
203802 37 0
206496 465 24
213344 239 2
213785 38 2
215717 3 41
229256 2 40
239201 64 0
240297 39 1
281661 1 41
289814 162 0
301658 64 1
303848 41 0
304527 2 33
304982 55 0
305548 2 33
305809 99 0
314282 36 1
314714 27 0
314888 26 1
316313 0 32
317653 43 1
318497 36 1
319313 40 0
319540 63 0
... ... ...
888803 48 0
888814 34 0
888815 35 0
888821 38 0
888834 43 0
888864 40 0
888870 98 0
888875 37 0
888885 33 1
888919 29 0
888931 36 0
888966 69 0
888979 42 0
888997 123 0
888998 2632 1
906428 35 0
906563 30 0
909275 103 8
909361 27 1
912226 29 2
919289 36 0
919636 25 2
925575 213 4
937595 66 4
940202 33 0
949528 1418 7
951225 2 29
954621 36 1
972201 86 2
973273 120 8

344 rows × 2 columns

Arg. Still not really easy to look at.

Did notice a few things though, namely that there are some area+office numbers that actually had a higher proportion of robocallers. Looks like decent numbers have no robocallers, could be one of those sections that is already populated by real people and no room to get additional numbers?

I'm tending towards using Random Forests to classify data, how well will it work when there aren't many samples for a given category?

Let's make a different version of that thresholded function now that lets you choose the direction of the threshold.


In [41]:
def explore_thresholded_feature(df, name, threshold=.20, min_samples=10, tend_toward_robocallers=True):
    feature_v_robocall = pd.crosstab([df[name]], df['LIKELY ROBOCALL'])
    proportionated_feature = feature_v_robocall.div(feature_v_robocall.sum(1).astype(float), axis=0)
    sample_size = feature_v_robocall.sum(1)
    
    # Seeking those with LOTS of robo callers
    threshold_true_robo = (proportionated_feature[True] > (1 - threshold))
    
    # Conditionally look at those that tend not to have robocallers
    if(not tend_toward_robocallers):
        threshold_true_robo |= proportionated_feature[True] < threshold
    
    # Limit by number of samples available
    threshold_true_robo &= (sample_size > min_samples)
    
    thresholded_feature_v_robocall = feature_v_robocall[threshold_true_robo]
    
    thresholded_feature_v_robocall.plot(kind='barh', stacked=True, color=["blue", "red"])
    proportionated_feature[threshold_true_robo].plot(kind='barh', stacked=True, color=["blue", "red"])
    
    return thresholded_feature_v_robocall

explore_thresholded_feature(labeled_data, "TO_OFFICE_CODE", threshold=.08, min_samples=25)


Out[41]:
LIKELY ROBOCALL False True
TO_OFFICE_CODE
215717 3 41
229256 2 40
281661 1 41
304527 2 33
305548 2 33
316313 0 32
347515 0 28
401515 4 64
408600 0 26
408824 2 37
508570 1 43
509314 4 82
510417 2 39
520353 3 36
786329 5 132
805284 3 56
831269 0 35
951225 2 29

That is an interesting collection. 786329 really stands out. We'll keep this as a categorical feature for our classifier.

It's about time


In [42]:
labeled_data["DATE/TIME"]


Out[42]:
0    2014-04-01
1    2014-04-01
2    2014-04-01
3    2014-04-01
4    2014-04-01
5    2014-04-01
6    2014-04-01
7    2014-04-01
8    2014-04-01
9    2014-04-01
10   2014-04-01
11   2014-04-01
12   2014-04-01
13   2014-04-01
14   2014-04-01
...
137348   2014-04-06 23:58:00
137349   2014-04-06 23:58:00
137350   2014-04-06 23:58:00
137351   2014-04-06 23:58:00
137352   2014-04-06 23:58:00
137353   2014-04-06 23:59:00
137354   2014-04-06 23:59:00
137355   2014-04-06 23:59:00
137356   2014-04-06 23:59:00
137357   2014-04-06 23:59:00
137358   2014-04-06 23:59:00
137359   2014-04-06 23:59:00
137360   2014-04-06 23:59:00
137361   2014-04-06 23:59:00
137362   2014-04-06 23:59:00
Name: DATE/TIME, Length: 137363, dtype: datetime64[ns]

In [43]:
# Extract Hour, Minute, and day of week
labeled_data["HOUR"] = labeled_data["DATE/TIME"].apply(lambda x: x.hour)
labeled_data["MINUTE"] = labeled_data["DATE/TIME"].apply(lambda x: x.minute)
labeled_data["DAYOFWEEK"] = labeled_data["DATE/TIME"].apply(lambda x: x.dayofweek)

In [38]:
explore_feature(labeled_data, "HOUR")


Out[38]:
LIKELY ROBOCALL False True
HOUR
0 4644 2143
1 3290 1201
2 2518 774
3 1185 305
4 638 36
5 582 8
6 379 7
7 335 9
8 321 2
9 375 4
10 409 1
11 872 18
12 2033 716
13 4043 1710
14 6354 3131
15 7648 3899
16 7692 4281
17 8104 4097
18 8436 4244
19 7637 4063
20 7418 4198
21 6617 3806
22 6075 3185
23 5161 2759

In [44]:
explore_feature(labeled_data, "MINUTE")


Out[44]:
LIKELY ROBOCALL False True
MINUTE
0 1674 703
1 1598 656
2 1615 679
3 1557 704
4 1565 752
5 1569 707
6 1643 738
7 1562 709
8 1577 754
9 1581 752
10 1566 763
11 1636 789
12 1681 740
13 1573 764
14 1543 753
15 1654 676
16 1645 781
17 1633 803
18 1591 790
19 1580 763
20 1552 728
21 1526 819
22 1528 752
23 1489 767
24 1612 707
25 1552 764
26 1496 754
27 1544 735
28 1467 761
29 1494 726
30 1621 763
31 1577 746
32 1518 716
33 1502 697
34 1477 808
35 1526 827
36 1552 787
37 1540 761
38 1544 800
39 1510 729
40 1531 754
41 1538 741
42 1525 724
43 1529 746
44 1471 769
45 1414 727
46 1509 799
47 1522 720
48 1528 741
49 1451 756
50 1523 753
51 1484 773
52 1495 686
53 1458 720
54 1544 771
55 1495 739
56 1457 661
57 1536 711
58 1556 727
59 1530 656

In [45]:
labeled_data["INTERVAL"] = pd.cut(labeled_data["MINUTE"], bins=range(-1,61,15), include_lowest=True)
explore_feature(labeled_data, "INTERVAL")


Out[45]:
LIKELY ROBOCALL False True
INTERVAL
(14, 29] 23363 11326
(29, 44] 22961 11368
(44, 59] 22502 10940
[-1, 14] 23940 10963

Minutes probably need to be paired up with hour


In [46]:
labeled_data["TIMECHUNK"] = labeled_data["DATE/TIME"].apply(lambda x: x.hour + np.floor(4*(x.minute/60.0))/4)

In [47]:
explore_feature(labeled_data, "TIMECHUNK")


Out[47]:
LIKELY ROBOCALL False True
TIMECHUNK
0.00 1286 578
0.25 1209 585
0.50 1121 546
0.75 1028 434
1.00 870 295
1.25 838 280
1.50 870 311
1.75 712 315
2.00 854 204
2.25 811 212
2.50 449 187
2.75 404 171
3.00 329 85
3.25 297 84
3.50 319 70
3.75 240 66
4.00 159 22
4.25 170 10
4.50 164 1
4.75 145 3
5.00 195 2
5.25 135 2
5.50 122 4
5.75 130 0
6.00 88 1
6.25 92 2
6.50 101 1
6.75 98 3
7.00 85 2
7.25 79 1
... ... ...
16.50 1912 1051
16.75 1869 1028
17.00 1938 1032
17.25 2047 961
17.50 1958 1024
17.75 2161 1080
18.00 2143 1041
18.25 2103 1055
18.50 2066 1136
18.75 2124 1012
19.00 2064 1008
19.25 1891 1039
19.50 1849 1014
19.75 1833 1002
20.00 2036 1166
20.25 1781 1126
20.50 1841 977
20.75 1760 929
21.00 1709 879
21.25 1628 1006
21.50 1615 972
21.75 1665 949
22.00 1582 860
22.25 1516 875
22.50 1532 785
22.75 1445 665
23.00 1441 655
23.25 1353 692
23.50 1197 717
23.75 1170 695

96 rows × 2 columns

Quite similar to the hour curve, but clearly more granular.

Is there a way to track that hours wrap around? Can my classifier care that this isn't bounded at 0 and 24, that there is a modulus involved?


In [48]:
explore_feature(labeled_data, "DAYOFWEEK")


Out[48]:
LIKELY ROBOCALL False True
DAYOFWEEK
1 19402 9437
2 21898 10250
3 19926 10455
4 17896 9145
5 8571 3874
6 5073 1436

Wait! Where is 0? That's some strong sampling bias if 0 (Monday) isn't even included...

UPDATE: I spoke with the group that produced the data and they accidentally got rid of Monday. I do have more data I could work with in the future.

Forest for the trees


In [49]:
def total_call_volume(df):
    sizes = df.groupby("FROM").size()

    def get_size(val):
        return sizes[val]

    df["NUM_FROM_CALLS"] = df["FROM"].apply(get_size)

    sizes = df.groupby("TO").size()
    df["NUM_TO_CALLS"] = df["TO"].apply(get_size)
    
total_call_volume(labeled_data)
total_call_volume(unlabeled_data)

In [50]:
explore_feature(labeled_data, "NUM_FROM_CALLS")


Out[50]:
LIKELY ROBOCALL False True
NUM_FROM_CALLS
1 23207 384
2 10834 592
3 7347 582
4 5444 740
5 3820 690
6 3402 666
7 2177 546
8 1840 544
9 1800 531
10 1530 620
11 1012 506
12 1692 336
13 975 494
14 770 462
15 900 525
16 672 528
17 714 425
18 792 522
19 551 304
20 440 600
21 252 567
22 418 286
23 552 391
24 504 336
25 275 400
26 364 442
27 351 486
28 252 420
29 348 290
30 450 420
... ... ...
274 0 274
285 0 285
289 0 289
292 0 292
341 0 341
350 0 350
351 0 351
355 0 355
361 361 0
366 0 366
371 0 371
379 0 379
386 386 0
397 0 397
411 411 411
417 0 417
436 436 0
443 0 443
477 477 0
480 0 480
488 0 488
590 0 590
645 0 645
674 0 674
858 0 858
915 915 0
945 0 945
1140 0 1140
1566 1566 0
3203 3203 0

176 rows × 2 columns

Summary

At this point, we've explored a few features and can build a model from them with some simple tools. Let's use those simple features to create a simple Random Forest classifier in Modeling Rachel the Robocaller.